DATA SCIENCE LIFECYCLE

PROBLEM STATEMENT

Access to safe drinking water is essential to health,a basic human right and a component of effective policy for health protection.This is important as a health and development issues at every level.In some region it is shown,it has been shown that investments in water supply and sanitation can yield a net economic benefits.so based on the feature given we have to determine whether the water is portable to drink or not.

1-PORTABLE 0-NOT PORTABLE

FEATURES

1.PH VALUE(ph)

PH is an important parameter in evaluating the acid–base balance of water. It is also the indicator of acidic or alkaline condition of water status. WHO has recommended maximum permissible limit of pH from 6.5 to 8.5. The current investigation ranges were 6.52–6.83 which are in the range of WHO standards

2.Hardness

Hardness is mainly caused by calcium and magnesium salts. These salts are dissolved from geologic deposits through which water travels.The length of time water is in contact with hardness producing material helps determine how much hardness there is in raw water. Hardness was originally defined as the capacity of water to precipitate soap caused by Calcium and Magnesium

3.Total salts dissloved(Solids)

Water has the ability to dissolve a wide range of inorganic and some organic minerals or salts such as potassium, calcium, sodium, bicarbonates, chlorides. This is the important parameter for the use of water. The water with high TDS value indicates that water is highly mineralized. Desirable limit for TDS is 500 mg/l and maximum limit is 1000 mg/l which prescribed for drinking purpose.

4.Chloramines

Chlorine and chloramine are the major disinfectants used in public water systems. Chloramines are most commonly formed when ammonia is added to chlorine to treat drinking water. Chlorine levels up to 4 milligrams per liter (mg/L or 4 parts per million (ppm)) are considered safe in drinking water.

5.Sulfate

Sulfates are naturally occuring substances that present in minerals rock and salts. Sulfate concentration in seawater is about 2,700 milligrams per liter (mg/L)

6. Conductivity

Pure water is not a good conductor of electricity rathera good insulator.generally the amount of dissolved salts determine the conductivity of water.

7. Organic_Carbon

Total organic carbon is a measure of the total amount of carbon in organic compounds in pure water. According to US EPA < 2 mg/L as TOC in treated / drinking water, and < 4 mg/Lit in source water which is use for treatment

8.Trihalomethanes

These are the chemicals which are mixed with chlorines. the concentration of thm depends on oragnic carbon dissolved.thm upto the level of 80 ppm is considered safed.

HYPOTHESIS CREATION

1.Portable water has a ph of range(7-8).

2.More the solid less will be the purity of water and hence less portable.

3.More the hardness less will be it's portability.

4.Since we know that hardness of water contained some dissloved minerals such as calcium and magnesium which is a metal therefore conductivitty increases as hardness increases.

5.More the organic_carbon more will be the ph.

6.Turbidity increases Conductivity of water

7.Conductivity decreases as Chloramines increases.

8.More will be the organic_carbon less will be it's portability.

9.More the ph quality then there will be chances that it is not portable.

10.If the quantity of Trihalomethanes is above 80 ppm then there will be less chances that it is portable.

11.More the sulfate present less will be it's potability.

12.More the sulfates more will be it's conductivity.

13.More will be the total salts less will be it's portability.

14.More the total salts more the conductivity.

15.More will be the Solids more will be the Hardness

Exploratory data analysis

Missing Value Imputation

CHECKING DIMENSIONALITY REDUCTION USING HEATMAP

Report

1.None of the columns are co-relating.

2.The highest corelation is 7.6% between ph and Hardness.

3.Many features or column have negative corelation as well.

Outlier Detection

Report

1.The median of the solids is around 21000. 2.Most of the point have above 4500. 3.There are many outliers point.

Report

1.The 50 percentile or median of ph is 7.

2.There are many outliers on both the side some has ph below 4 and some has ph above 10.

Report

1.Non-drinkable or non-potable water is more.

2.Non-potability depends on various factors such as carbon content,hardness,sulfates and ph.

3.Hence there are more columns or features which reduces the quality of water

Report

1.ph value of 7 has lowest conductivity and it is non portable.

2.ph 0 which is acidic in nature has conductivity near 600 and non portable.

3.Water which is portable has a ph range between 5 to 9.

4.Highest conductivity is more than 700 which is non-portable.

5.ph of 13 is portable which is an outlier

Report

1.Sulfate less than 150 has a conductivity of 550 which is portable in nature.

2.Most of the water sulfate is in between 250 and 400.

3.lowest conductivity is less than 200 and has sulfate of 340.

4.there is one outlier which has sulfate of more than 450 and water is portable in that case.

Hence sulfates does not contribute much in the conductivity.

1.Turbidity which lie between 3 to 7 that has most of the portable water.

2.most turbide water is non portable in nature.

3.Turbidity of 3 has highest conductivity.

4.some point which has 6n turbidity in that case water is drinkable.

Report

1.Data is left skewed.

2.most of the solids are in range 5000 to 30000.

3.non portable water has highest conductivity more than 700 which is outlier.

Report

1.There are two orange point which means portable which have chloramines of 0.

2.Chloramines of 8 has highest conductivity of more than 700 which is an outlier.

  1. Most points lies between cholramines of 4 and 10.

4.Some orange points(portable) has cholramines of 12 and greater than 12.

5.Lowest conductivty is at ph value of 7.

Report

1.Organic_carbon of 0 is portable water which means it is correctly classified.

2.Organic_carbon of more than 25 is non-portable.

3.Organic_carbon doesn't contribute much in deciding thde potability.

Report

1.There are more portable water pointa than non portable.

2.sulfate decide the potability of water.

3.Sulfate of less than 150 is classified into potable water.

1.Blue points(non potable is not widely distributed when it compare with orange point(potable)

2.Even with more sulfate there is equal chances that water is potable.

3.Mostly non potable(blue points)has a sulfate range between 250 and 400.

4.Even the ph of 13 and sulfatde 360 is classified as potable water.

Report

1.sulfate of 0 has conductivity of 550.

2.sulfate of almost 350 is classified as non potable on both extreme.

3.At 340 there are so many orange point arranged in a sequence which means they are overlapping which means sulfate of around 340 has most orange point(potable)

Report

1.most of the Non-potable(blue points) has hardness in range 150 and 250.

2.Where as potable water points(orange) is widely distributed.

3.Hardness greater than 300 has mostly orange points(which means water is potable).

4.This proves that our hypothesis is wrong.

Report

1.For pure water (orange point) hardness varies a lot.

2.Even hardness above than 300 is classified into potable water.

3.Distribution of potable water is more than non potable.

Report

1.mostly potable water lies in range of ph 5 to 9.

2.no orange point(potable) has carbon content more than 24.

3.Carbon content of 0 and ph of 5 is potable.

Report

  1. non potable water has hardness greater than 100.
  1. mostly potable water has ph range between 5 to 9 where as for non potablen it is widely distributed.
  1. hardness of almost 50 and ph of 13 is potable.

Report

1.it clearly shows that our ph is normally distributed.

2.it also refelect that most of the water quality has a ph of 7

Seperating Target variable and Predictors variables

Train Test split to fit the model and check its performance

Standardizing the X_train and X_test

Which model do you think would be most appropriate and why

Since we have outlier in our data set on which we cannot perform outlier treatment since the meaning of that variable is completly changing for example variable 'ph' has outliers in if we treat them by dropping then many observations will be lost which we cannot afford , if we treat them them winsorization the 'ph' of 14(outlier) which will be base in nature is converted into 'ph' of 8 which will be neutral in nature then whole meaning will be change similarly for other variables is also the same case ,so we cannot treat the outliers .

Then we need to select such model which will be roburst to outliers .

we all know that tree based model will be roburst to outliers ,example of tree based models are decision tree model, random forest classifier model will be good appropriate option for our dataset,as Tree algorithms split the data points on the basis of same value and so value of outlier won't affect that much to the split.

List of model we have tried are

1-Decision tree

2-Decision tree with parameter tuning

3-Random forest

4-random forest with parameter tuning

5-Logistic regression (we know it will get affected by outlier but still we tried)

6-adaboost

7-adaboost for desicion tree with hyper parameter tuning

Model fitting

model1 DT DECISION TREE

model2 DECISION TREE WITH HYPER PARAMETER TUNING (WITH ENTROPY)

model3 DT (DESICION TREE WITH HYPER PARAMETER TUNING WITH GINI)

TRYING WITH LOGISTICS REGRESSION

USING ENSEMBLE MODEL RANDOM FOREST

RANDOM FOREST WITH HYPER PARAMETER TUNING

For selection of best model we are comparing the accurary and FPR for each model.Reason for looking FPR is if water is not fit for drinking but our model is predicting as drinkable which will have more bad effect on the user than if the model wrongly predict the drinkable water as non drinkable water based on these matrix

Random forest with hyper parameter tuning has the highest accuracy of 66.53% among all the models and also the FPR is minimum for that model 7.8%

Our hypothesis that Random forest ML algo will perform better has been foundd true

why we think random forest perform better than other model

We think random forest classifier perform better than other selected model because it is roburst against outlier and it is based on the bagging algorithm and uses Ensemble Learning technique.

It creates as many trees on the subset of the data and combines the output of all the trees.In this way it reduces overfitting problem in decision trees and also reduces the variance and therefore improves the accuracy.

it use column+ row sampling and everytime it divide into subset reduces the data point from that subset but overall aggregate almost remains the same and variance reduces